DOC GH17505 Added some links and examples. To little/much/wrong? #17908

linebp · 2017-10-17T20:52:46Z

closes DOC: nice links / examples for setting with copy & aggregation #17505
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Before I spend more time on this, I'd like to know if I am doing to much, to little or just plain all wrong.

codecov · 2017-10-17T22:28:47Z

Codecov Report

Merging #17908 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17908      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50105    50105              
==========================================
- Hits        45715    45706       -9     
- Misses       4390     4399       +9

Flag	Coverage Δ
#multiple	`89.03% <ø> (ø)`	⬆️
#single	`40.31% <ø> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.75% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bf7f9a...ccfa848. Read the comment docs.

jreback · 2017-10-17T22:51:01Z

doc/source/groupby.rst


-   df.groupby(['A', 'B']).sum().reset_index()
+    ``count``, Number of non-NA observations


these can be links as well

The links to the functions available with aggregate? I think that would be a great idea. Where can I find the list of available functions and the shortcuts? I figured that must have been documented elsewhere at some point but couldnt find it?

jreback · 2017-10-17T22:52:15Z

can you post a rendered version of this page? since you are doing lots of changes, hard to see what the new version would look like.

linebp · 2017-10-19T18:19:14Z

Group By: split-apply-combine

Split-apply-combine is a common paradigm in data analysis. It involves splitting
the data set into smaller groups, applying some operation to each group
independently and combining the results into a data structure. This strategy is
supported for example in Excels pivot tables, SQLs group by operator and Rs
plyr package. This section will look at the the Pandas groupby and
related functions and show you how to do split-apply-combine in Pandas. See the
:ref:cookbook<cookbook.grouping> for some advanced strategies

The split step is the most straightforward. See the section on
:ref:splitting<groupby.split> below.

In the apply step you may wish to apply one of the following operations:

Aggregate: Get a single value for each group. This could be a summary
statistic like sum or mean of some column or counting the number of members
in the group. See the section on :ref:aggregating<groupby.aggregate> below.
Filter: When you want a subset of your original data. Discard data
according to some function applied to each group. This can be useful when
for example you wish to discard groups with low member count. See the section on
:ref:filtering<groupby.filter> below.
Transform: A new value for each original row. This can be used to
normalize/scale data or filling in erroneous or missing values. See the
section on :ref:transforming<groupby.transform> below.
Pandas has direct support for these three operations and will try and return a
sensibly combined result. See here for further help on when to use aggregate/filter/transform in Pandas <https://pythonforbiologists.com/when-to-use-aggregatefiltertransform-in-pandas/>_.

Pandas also supports iteration over the groups created in the split step; Using
iteration over the groups (rather than the three shortcut functions) renders
more control over the apply and combine parts of the process, but also requires
more work from the programmer. See the section in
:ref:iterating<groupby.iterating> below.

linebp · 2017-10-19T18:26:48Z

Aggregation¶

This section describes how to aggregate data. We will be giving examples using the tips.csv dataset. Each row represents a meal at some restaurant; The columns store the value of the total bill, the size if the tip and some metadata about the customer.

In [54]: tips = pd.read_csv('./data/tips.csv')

In [55]: tips
Out[55]: 
 total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
5         25.29  4.71    Male     No   Sun  Dinner     4
6          8.77  2.00    Male     No   Sun  Dinner     2
..          ...   ...     ...    ...   ...     ...   ...
237       32.83  1.17    Male    Yes   Sat  Dinner     2
238       35.83  4.67  Female     No   Sat  Dinner     3
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[244 rows x 7 columns]

What if we wanted to know the average total bill on each day? We split the data so that each group consists of all the meals eaten on the same day. We want a single value for each group, so we should use the aggregate function:

In [56]: tips.groupby('day').aggregate('mean')
Out[56]: 
 total_bill       tip      size
day 
Fri    17.151579  2.734737  2.105263
Sat    20.441379  2.993103  2.517241
Sun    21.410000  3.255132  2.842105
Thur   17.682742  2.771452  2.451613

The result has the group names, in this case the days, as the index along the grouped axis. Along the other axis we have the columns for which Pandas could calculate a mean, i.e. the ones with a numeric data type. We could have selected the total_bill column either before or after aggregating to limit the result to this column, but not before splitting since we need the day column for the splitting.

How about the number of guests for each day and for each time of day? In this case it is not enough to split the data on the day it was eaten, we also need split by the time of day. Instead of calculating the mean, like in the previous example we use the sum function.

In [57]: tips.groupby(['day', 'time'])['size'].agg('sum')
Out[57]: 
day   time 
Fri   Dinner     26
 Lunch      14
Sat   Dinner    219
Sun   Dinner    216
Thur  Dinner      2
 Lunch     150
Name: size, dtype: int64

agg is short for aggregate.

Pandas has support for a number of basic descriptive statistic functions which can be used with aggregate:
count, Number of non-NA observations
sum, Sum of values
mean, Mean of values
mad, Mean absolute deviation
median, Arithmetic median of values
min, Minimum
max, Maximum
mode, Mode
abs, Absolute Value
prod, Product of values
std, Bessel-corrected sample standard deviation
var, Unbiased variance
sem, Standard error of the mean
skew, Sample skewness (3rd moment)
kurt, Sample kurtosis (4th moment)
quantile, Sample quantile (value at %)

What if we need to know the difference between the smallest and largest total bill for each day? Again we split the data so each group has the meals eaten on the same day. But which function do we use to find the difference? The agg function also accepts a function as argument. The function is called once per group with the current group as argument and should return a single value.

In [58]: tips.groupby(['size']).agg(lambda group: max(group) - min(group))['total_bill']
Out[58]: 
size
1     7.00
2    34.80
3    40.48
4    31.84
5    20.50
6    21.12
Name: total_bill, dtype: float64

linebp · 2017-10-19T18:28:57Z

The formatting is not super but I hope it gives a better idea of whats been changed and how it would look.
I pasted the 2 sections I rewrote above, it should be obvious where they belong in the original?

Is there a better way to do this?

jreback · 2017-11-23T16:42:05Z

@linebp can you post a rendered screenshot of this?

jreback · 2018-01-21T18:11:18Z

can you rebase and show a rendered screenshot?

jreback · 2018-02-24T17:09:07Z

@linebp sorry let this get away from us. happy to have some clarifications, but can you do in a more targeted manner, IOW more PR's with smaller changes is usually better.

linebp · 2018-02-28T11:43:31Z

Ill have a look at again and see if I can do a PR with a smaller change and do a screenschot of the rendered changes?

DOC GH17505 Added some links and examples. To little/much/wrong?

ccfa848

jreback added Docs Groupby labels Oct 17, 2017

jreback reviewed Oct 17, 2017

View reviewed changes

jreback closed this Feb 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

linebp commented Oct 17, 2017

codecov bot commented Oct 17, 2017 •

edited

Loading

jreback Oct 17, 2017

linebp Oct 19, 2017

jreback commented Oct 17, 2017

linebp commented Oct 19, 2017

linebp commented Oct 19, 2017

Aggregation¶

linebp commented Oct 19, 2017 •

edited

Loading

jreback commented Nov 23, 2017

jreback commented Jan 21, 2018

jreback commented Feb 24, 2018

linebp commented Feb 28, 2018


		df.groupby(['A', 'B']).sum().reset_index()
		``count``, Number of non-NA observations

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

Conversation

linebp commented Oct 17, 2017

codecov bot commented Oct 17, 2017 • edited Loading

Codecov Report

jreback Oct 17, 2017

Choose a reason for hiding this comment

linebp Oct 19, 2017

Choose a reason for hiding this comment

jreback commented Oct 17, 2017

linebp commented Oct 19, 2017

linebp commented Oct 19, 2017

Aggregation¶

linebp commented Oct 19, 2017 • edited Loading

jreback commented Nov 23, 2017

jreback commented Jan 21, 2018

jreback commented Feb 24, 2018

linebp commented Feb 28, 2018

codecov bot commented Oct 17, 2017 •

edited

Loading

linebp commented Oct 19, 2017 •

edited

Loading